CAPSTONE FOR DATA SCIENCE CERTIFICATION

Analysis of Barcelona's hostelry, hospitality and recreational sector for supply and service industry development


Introduction

This try to be a parcial market study about hospitality sector in Barcelona's city. It's also included the hostelry and recreational bussiness like gymns, in order to support the auxiliar service and utilities industry that supplay to those mentioned main industries.

Target market for the study

For instance, catering companies that work serving to hotels or restaurants, cleaning and desinfection companies, laundries, kitchenware suppliers, fresh food, etc.. all the companies that give services or supply goods for hotels, hostals, restaurants, and/or gyms in Barcelona

Area of study

All the mentioned services's and utilities companies above need to know where fisicaly their possible objetive clients are. Also, they need to create areas dividing the city in order to give the best posible service for their clients. That's why, this studie try to use the known data bases of Barcelona to divide the city in areas and get a possible location for their warehouse or centralitation office for each area.

Methodology

It is possible to use K-Mean clustering to divide the city in areas or groups because K-Mean try to group the elements with the less posible distance between elements of the group but getting the highest distance between each group. Furthermore, the K-Mean method provide the centroids, it means, a center point of the group where idealy can be localized an operational centralitation like warehouse or office to operate each group. Instead of use an Euclidean space, we can use longitude and latitude as a Cartesian space for K-Means.

Data sources

Barcelona's opendata repository

https://opendata-ajuntament.barcelona.cat/

In the link above it's posible to find a large source of oficial information from Barcelon's Council. Thanks to this open data availability , it's why I decided to focuse the study in Barcelona's city. The dataset provides the information in wide range of formats. The acaccessibility can be directly through the links in the web or through their different API's. A better explanation of the API management can be found on the link below :

https://opendata-ajuntament.barcelona.cat/en/desenvolupadors

Foursquare API

Because in the council dataset, the information about potential recreational business like gyms are mixed with other sectors and the information is not easy to handle, in the study will be used Foursquare dataset to get the closed gyms to the centroids based on their distance. To get it, the Python Foursquare pakage will be used to get an easy access.

https://pypi.org/project/foursquare/


Data Processing

Required libraries

python version 	:	 3.8.6
pandas version 	:	 1.1.4
numpy version 	:	 1.19.3
folium version 	:	 0.11.0
matplotlib version 	:	 3.3.2
foursquare version 	:	 1!2020.1.30
sklearn version 	:	 0.23.2
geopy version 	:	 2.0.0

Data collection

Please, for further information, see the link below to get an explanation of how to get the information from Barcelona's open data repository. For this study it's intreating the information of gesthouses, hostels, hotels and restaurants. That information is available in different formats, in this case, the CVS format will be selected. The code will generate a pandas dataframe for each table and print the name of the dataframe and its shape to get an idea about how long the table it's.

https://opendata-ajuntament.barcelona.cat/en/desenvolupadors

DataFrame guesthouse shape :   (580, 33)
DataFrame hostel shape :   (564, 33)
DataFrame hotel shape :   (884, 33)
DataFrame restaurants shape :   (16936, 33)
DataFrame Neighborhood_BCN shape :   (73, 4)

Format tables and depreciation of unnecessary columns

Accommodations

"guesthouse", "hostel" and "hotel" tables contain the same structure so all of them can be transformed and join in a single dataframe

A "Category" column is created to define the diferent type of business to keep that information after being joined

In "...NIVELL" columns it's posible to find which category of accommodation it is. For instance, "3ER_NIVELL" column contains the hotel category stars, "2N_NIVELL" column contains if the guesthouse is an apartment , students residency, young hostel or other type. For hostels there isn't any farther information. What it's propoused is to map that information with numbers from 10 to 0. On one hand that information can be categorized, making luxury hotels more interesting than young hostels.On the other hand, it's better for an easier access in further analysis.

Definitions :

  • For hotels, stars +10 --> from 0 starts to 5 starts = 5 to 10.
  • hostels =4
    • guesthouse;
      • Apartments =3
      • student housing and University residences =2
      • young hostel = 1
      • not defined categories = 0

Now the tables are joined in one dataframe called 'accommodation'

                                      accommodations dataframe                                      
CODI_EQUIPAMENT EQUIPAMENT SECCIO TIPUS_VIA NOM_CARRER NUM_CARRER_1 NUM_CARRER_2 CODI_BARRI NUM_BARRI CODI_DISTRICTE ... HORARI_DIES HORARI_HORES_INICI HORARI_HORES_FI HORARI_OBSERVACIONS 3ER_NIVELL 2N_NIVELL 1ER_NIVELL Unnamed: 32 CATEGORY LEVEL
0 146175914 Residència Universitària Lesseps # Pl Lesseps 12 12.0 31 la Vila de Gràcia 6 ... NaN NaN NaN NaN NaN Residències d'estudiants Allotjament NaN guesthouse 2
1 146175914 Residència Universitària Lesseps # Pl Lesseps 12 12.0 31 la Vila de Gràcia 6 ... NaN NaN NaN NaN NaN Residències d'estudiants Allotjament NaN guesthouse 2
2 146175914 Residència Universitària Lesseps # Pl Lesseps 12 12.0 31 la Vila de Gràcia 6 ... NaN NaN NaN NaN NaN Residències d'estudiants Allotjament NaN guesthouse 2

3 rows × 35 columns

Unnecessary columns and possible dupicated lines are droped

Columns before cleaning :  Index(['CODI_EQUIPAMENT', 'EQUIPAMENT', 'SECCIO', 'TIPUS_VIA', 'NOM_CARRER',
       'NUM_CARRER_1', 'NUM_CARRER_2', 'CODI_BARRI', 'NUM_BARRI',
       'CODI_DISTRICTE', 'NOM_DISTRICTE', 'CODI_POSTAL', 'CODI_POBLACIO',
       'POBLACIO', 'LATITUD', 'LONGITUD', 'X_ETRS89', 'Y_ETRS89', 'X_ED50',
       'Y_ED50', 'TELEFON_NUM', 'TELEFON_TIPUS', 'TELEFON_INFO_COM',
       'HORARI_PERIODE_INICI', 'HORARI_PERIODE_FI', 'HORARI_DIES',
       'HORARI_HORES_INICI', 'HORARI_HORES_FI', 'HORARI_OBSERVACIONS',
       '3ER_NIVELL', '2N_NIVELL', '1ER_NIVELL', 'Unnamed: 32', 'CATEGORY',
       'LEVEL'],
      dtype='object') 
Shape :  (2028, 35)
Columns after cleaning :  Index(['CODI_EQUIPAMENT', 'EQUIPAMENT', 'CODI_BARRI', 'NUM_BARRI',
       'CODI_DISTRICTE', 'NOM_DISTRICTE', 'CODI_POSTAL', 'LATITUD', 'LONGITUD',
       'CATEGORY', 'LEVEL'],
      dtype='object') 
Shape :  (1031, 11)

And print the dataframe

accommodations dataframe
(1031, 11)
CODI_EQUIPAMENT EQUIPAMENT CODI_BARRI NUM_BARRI CODI_DISTRICTE NOM_DISTRICTE CODI_POSTAL LATITUD LONGITUD CATEGORY LEVEL
0 146175914 Residència Universitària Lesseps 31 la Vila de Gràcia 6 Gràcia 8023 41.407168 2.151154 guesthouse 2
3 146180329 Residència Universitària La Ciutadella 66 el Parc i la Llacuna del Poblenou 10 Sant Martí 8018 41.391563 2.186115 guesthouse 2
6 75990016464 Col.legi Major Universitari Lestonnac 7 la Dreta de Eixample 2 Eixample 8007 41.392634 2.165757 guesthouse 2

Restaurants

Drop duplicated lines unnecessary columns and print

Restaurants columns :  Index(['CODI_EQUIPAMENT', 'EQUIPAMENT', 'SECCIO', 'TIPUS_VIA', 'NOM_CARRER',
       'NUM_CARRER_1', 'NUM_CARRER_2', 'CODI_BARRI', 'NUM_BARRI',
       'CODI_DISTRICTE', 'NOM_DISTRICTE', 'CODI_POSTAL', 'CODI_POBLACIO',
       'POBLACIO', 'LATITUD', 'LONGITUD', 'X_ETRS89', 'Y_ETRS89', 'X_ED50',
       'Y_ED50', 'TELEFON_NUM', 'TELEFON_TIPUS', 'TELEFON_INFO_COM',
       'HORARI_PERIODE_INICI', 'HORARI_PERIODE_FI', 'HORARI_DIES',
       'HORARI_HORES_INICI', 'HORARI_HORES_FI', 'HORARI_OBSERVACIONS',
       '3ER_NIVELL', '2N_NIVELL', '1ER_NIVELL', 'Unnamed: 32'],
      dtype='object') 
Shape :  (16909, 33)
Restaurants columns after cleaning :  Index(['CODI_EQUIPAMENT', 'EQUIPAMENT', 'CODI_BARRI', 'NUM_BARRI',
       'CODI_DISTRICTE', 'NOM_DISTRICTE', 'CODI_POSTAL', 'LATITUD',
       'LONGITUD'],
      dtype='object') 
Shape :  (2469, 9)

And print the first lines of the table

                                       restaurants dataframe                                        
CODI_EQUIPAMENT EQUIPAMENT CODI_BARRI NUM_BARRI CODI_DISTRICTE NOM_DISTRICTE CODI_POSTAL LATITUD LONGITUD
0 53170444 Restaurant Tapes Tapas *Poble Espanyol 11 el Poble-sec 3 Sants-Montjuïc 8038 41.369148 2.146751
9 94135210 Restaurant Japonès Ginza 8 l'Antiga Esquerra de l'Eixample 2 Eixample 8008 41.392182 2.157920
15 140145849 Hotel Rívoli Ramblas - HB-003878 2 el Barri Gòtic 1 Ciutat Vella 8002 41.384315 2.171183

Visualization

Visualization of the information on maps

For display the information on the map, it's used folium technology

In this map we see all the restaurants and accomodations in the city. In the control layer can be selected the differnt layers and the marks has been clustered for a clear visualization.

Make this Notebook Trusted to load map: File -> Trust Notebook

Let's see the information in a choropleth map. In one map are included the accomodations and it's density by neighborhoods. The next one contain the restaurants and density by neighborhoods, also. For that, it's prepared a table grouping the elements by neighborhood.

Make this Notebook Trusted to load map: File -> Trust Notebook
Make this Notebook Trusted to load map: File -> Trust Notebook

Analysis

As we can see, most of the restaurans and accomodations are concentred in two or three neighborhood arround the city center:

Neighborhoods with more accommodations
Neighb_code Accom_quantity District_code District_name Neighb_Name
6 7 253 2 Eixample la Dreta de l'Eixample
1 2 112 1 Ciutat Vella el Barri Gòtic
7 8 91 2 Eixample l'Antiga Esquerra de l'Eixample
0 1 86 1 Ciutat Vella el Raval
30 31 40 6 Gràcia la Vila de Gràcia
                    
Neighborhoods with more restaurants
Neighb_code Restaur_quantity District_code District_name Neighb_Name
6 7 291 2 Eixample la Dreta de l'Eixample
7 8 265 2 Eixample l'Antiga Esquerra de l'Eixample
1 2 159 1 Ciutat Vella el Barri Gòtic
30 31 159 6 Gràcia la Vila de Gràcia
25 26 151 5 Sarrià-Sant Gervasi Sant Gervasi - Galvany

Those companies which doesn't want to provide services to all of the possible clients (restaurants and accomodations) in all over the city, better will be concrentrated on 'Eixample' and 'Ciutat Vella' districts.

But in this study we want to analyze those companies which aim to attend all ppossible clients in the city. To capture such a large amount of clients, these companies need to divide their resources and create working groups in order to provide a better service. Imagine that each group is in charge of a different area of the city and each area needs a work-center (warehouse or office) to support the group. Each area needs to capture the maximum possible clients with the less distance between them.

For this task is possible to use KMeans. It will divide the clients in the number of groups that we want and provide the centroids where to locate the workin-centers. Each group will optimize the resources

First of all it's necesary to determine the number of areas. So we can use the Inertia to evaluate wich number of groups will be better. let's see what is the inertia dividing the city between 2 to 8 groups for accomodations, and the same for restaurants:

GROUPS OF ACCOMMODATIONS
inertia for 2 groups :  0.251
inertia for 3 groups :  0.185
inertia for 4 groups :  0.142
inertia for 5 groups :  0.117
inertia for 6 groups :  0.1
inertia for 7 groups :  0.087
inertia for 8 groups :  0.074
variance between 2 and 3 groups :  0.066
variance between 3 and 4 groups :  0.043
variance between 4 and 5 groups :  0.025
variance between 5 and 6 groups :  0.016
variance between 6 and 7 groups :  0.014
variance between 7 and 8 groups :  0.012
----------------------------------------------
GROUPS OF RESTAURANTS
inertia for 2 groups :  0.829
inertia for 3 groups :  0.577
inertia for 4 groups :  0.447
inertia for 5 groups :  0.364
inertia for 6 groups :  0.298
inertia for 7 groups :  0.255
inertia for 8 groups :  0.22
variance between 2 and 3 groups :  0.253
variance between 3 and 4 groups :  0.13
variance between 4 and 5 groups :  0.083
variance between 5 and 6 groups :  0.066
variance between 6 and 7 groups :  0.043
variance between 7 and 8 groups :  0.035
----------------------------------------------
GROUPS OF ACCOMODATIONS AND RESTAURANTS TOGETHER
inertia for 2 groups :  1.107
inertia for 3 groups :  0.767
inertia for 4 groups :  0.597
inertia for 5 groups :  0.493
inertia for 6 groups :  0.413
inertia for 7 groups :  0.354
inertia for 8 groups :  0.304
variance between 2 and 3 groups :  0.253
variance between 3 and 4 groups :  0.13
variance between 4 and 5 groups :  0.083
variance between 5 and 6 groups :  0.066
variance between 6 and 7 groups :  0.043
variance between 7 and 8 groups :  0.035

We can observe that applying KMean on lower quantity of elements, generate lower inertia. It means, considering only the accomodations will generate less inertia than the restaurants or all together because there are less elements. However, the sum of the individual inertias is higher than the inertia of all together .

So we will consider all the elements together and not separete restaurants and hotels
Secondly, we observe that as higher is the number of the groups, lower is the inertia (something logical). But the difference betwbetween 7 to 8 groups it's very low and considering that as more groups we do, less resources we have because the resorcess should be distributed by groups. It's necesary to find the less groups with a better performance.
For that reason, we will considere a division in 6 groups

Let's apply the kMeans analysis for all elements in 6 groups and generatea a table including all the elements and a new column with the number of the group to which they belong to.

all elements with assigned area
(3500, 12)
CODI_EQUIPAMENT EQUIPAMENT CODI_BARRI NUM_BARRI CODI_DISTRICTE NOM_DISTRICTE CODI_POSTAL LATITUD LONGITUD CATEGORY LEVEL GROUP
0 146175914 Residència Universitària Lesseps 31 la Vila de Gràcia 6 Gràcia 8023 41.407168 2.151154 guesthouse 2.0 4
1 146180329 Residència Universitària La Ciutadella 66 el Parc i la Llacuna del Poblenou 10 Sant Martí 8018 41.391563 2.186115 guesthouse 2.0 0
2 75990016464 Col.legi Major Universitari Lestonnac 7 la Dreta de Eixample 2 Eixample 8007 41.392634 2.165757 guesthouse 2.0 4
3 75990020227 Residencia Madre Isabel Larrañaga 24 les Tres Torres 5 Sarrià-Sant Gervasi 8017 41.397669 2.128852 guesthouse 2.0 2
4 75990020841 Residencia e Institución Javeriana 4 Sant Pere, Santa Caterina i la Ribera 1 Ciutat Vella 8003 41.385713 2.177711 guesthouse 2.0 0

Now we can divide that long dataframe in 6 dataframes one for each group:

shape of dataframe "group_0" :  (1117, 12)
shape of dataframe "group_1" :  (452, 12)
shape of dataframe "group_2" :  (517, 12)
shape of dataframe "group_3" :  (184, 12)
shape of dataframe "group_4" :  (974, 12)
shape of dataframe "group_5" :  (256, 12)

à Voilà!! Here we can see all the restaurants and accomodations of Barcelona clustered in 6 groups by K-Mean.

And the centroids, with their locations

centroid for group 0 : 41.38549036021505, 2.1756782096774194 
	location: 4, Carrer del Doctor Joaquim Pou, el Gòtic, Ciutat Vella, Barcelona, Barcelonès, Catalunya, 08002, España
centroid for group 1 : 41.37808499559471, 2.156656528634361 
	location: 65, Carrer de Calàbria, Project Area Sant Antoni, Font Trobada, Barcelona, Barcelonès, Catalunya, 08015, España
centroid for group 2 : 41.39092840038685, 2.1329128065764023 
	location: 10, Carrer de la Caravel·la La Niña, les Corts, Barcelona, Barcelonès, Catalunya, 08017, España
centroid for group 3 : 41.42713067391304, 2.1785231576086956 
	location: 4, Carrer de Sant Pasqual Bailón, el Congrés, el Congrés i els Indians, Sant Andreu, Barcelona, Barcelonès, Catalunya, 08027, España
centroid for group 4 : 41.3966167995889, 2.158021282631038 
	location: Església i convent de Pompeia, Carrer de la Riera de Sant Miquel, la Vila de Gràcia, Gràcia, Barcelona, Barcelonès, Catalunya, 08001, España
centroid for group 5 : 41.4009648828125, 2.19924386328125 
	location: Schindler. Liberty Seguros. Regal, Carrer de Pujades, el Poblenou, Sant Martí, Barcelona, Barcelonès, Catalunya, 08001, España

Foursquare to find Gyms

As commented in the introduction of the article , we can include the gyms, too. Because they have some similar requirements of providers and services as accomodations or restaurants. Information about gyms is possible to be found in Barcelona's open data repository. But it's mixed with other business, so easily it can be provided by Foursquare. For this process it's used the Foursquare's API which can found in pip repository.

<foursquare.Foursquare at 0x1b27940cac0>

Now we can get the closest gyms to the centroids as possible clients to be study. Due to the limitations of Foursquare , a limit of 50 items is settled for each group and a radious of 1.5km arroun the center point. The result are achieved in separated dataframes for each group

DataFrame Gyms_group_0 shape :   (49, 8)
DataFrame Gyms_group_1 shape :   (50, 8)
DataFrame Gyms_group_2 shape :   (48, 8)
DataFrame Gyms_group_3 shape :   (44, 8)
DataFrame Gyms_group_4 shape :   (49, 8)
DataFrame Gyms_group_5 shape :   (49, 8)

Below we can see two examples of the dataframes:

                                      DataFrame  Gyms_group_5                                       
NAME LOCATION LATITUD LONGITUD DISTANCE_CENTOID ZIP NEIGHBORHOOD GROUP
0 Àgora (Psicoterapia y Desenvolvimiento Humano) Calle Pujades 41.399024 2.197441 263 08005 NaN 5
1 Barcelona Martial Arts Academy (BCNMA) Roc Boronat, 66 41.400230 2.198661 95 NaN NaN 5
2 Corporació Fisiogestion Carrer de la Ciutat de Granada, 115 41.401710 2.193483 488 08018 NaN 5
3 Feeling Woman Pere IV 29-35, 3º 5ª 41.394764 2.190814 985 08018 NaN 5
4 Duet Fit 208 Avinguda Diagonal 41.404581 2.191114 789 08018 NaN 5
                                      DataFrame  Gyms_group_3                                       
NAME LOCATION LATITUD LONGITUD DISTANCE_CENTOID ZIP NEIGHBORHOOD GROUP
0 Duet Sports Coxteres Borbó (Oficial). C/ Ramón Albó, 41-57 41.426602 2.178902 66 08027 NaN 3
1 Gym Guinardó Carrer del Telègraf, 33-65 41.416792 2.173619 1221 08041 NaN 3
2 Holmes Place Avinguda de Rio de Janeiro, 42 41.435380 2.180947 940 08016 NaN 3
3 Crossfit Sant Andreu 71 Calle del Segre 41.431867 2.194215 1411 08030 NaN 3
4 Club Natació Sant Andreu Rambla Fabra i Puig 45-47 41.429418 2.186491 712 08030 NaN 3

Let's see the information on a map:

Make this Notebook Trusted to load map: File -> Trust Notebook

Conclusion

As we can see, K-Mean can be used also to distribute a big group of phisical points in the space in small groups which get the less distance between the elements into each group. This can be useful for courier companies, distribution, logistics, etc..